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Signature Trees for Signature Files 



Yangjun Chen 

Dept. Business Computing, Winnipeg University 
515 Portage Ave. Winnipeg, Manitoba, Canada, R3B 2E9 

Abstract The signature file method is a popular indexing technique used in information retriev- 
al and databases. It excels in efficient index maintenance and lower space overhead. However, 
it suffers from inefficiency in query processing due to the fact that for each query processed the 
entire signature file needs to be scanned. In this paper, we introduce a tree structure, called a 
signature tree, established over a signature file, which can be used to expedite the signature file 
scanning by one order of magnitude or more. 

Keywords: index, signature file, signature identifier, signature tree, information retrieval 
1. Introduction 

An important question in information retrieval is how to create a database index which can be 
searched efficiently for the data one seeks. Today, one or more of the following three tech- 
niques have been frequently used: full text searching, inversion and the signature file. Full text 
searching imposes no space overhead, but requires long response time. In contrast, inversion 
and the signature file work quickly, but need a large intermediary representation structure (in- 
dex), which provides direct links to relevant data. 

The inverted index excels in query processing efficiency. It is a set of postings lists [HEBL92], 
each of which maps one keyword to a list of links to the data entries containing that keyword. 
Inverted indices can be implemented as sorted arrays, tries, B-tree and various hashing struc- 
tures, whereby each real text block address (or document identifier) is stored more than once. 
The scheme needs to frequently undergo re-organization under intensive information insertion/ 
updating procedures. Recently, A lot of work has been done on the encoding of postings list in 
the context of document databases [MZ96, ZMR98]. Using Golomb's encoding for the integers 
[G066], the size of the inverted index can be reduced to 14% of the indexed data with little or 
no loss of retrieval effectiveness [ZMR98]. However, Golomb's encoding can not be utilized 
in some applications. For instance, in an object-oriented database system, if the inverted index 
is used, the postings list will be a a series of pairs of the form: (C, oid), where C represents a 
class name and oid represents an object identifier, not satisfying the encoding condition. There- 
fore, in the context of object-oriented databases, the inverted file will require much storage 
space for postings lists [Ca75, Ha81]. 

The signature file method was originally introduced as a text indexing methodology [Fa85, 
FLPS90]. Nowadays, however, it is utilized in a wide range of applications, such as in office 
filing [CTHP86], hypertext systems [FLPS90], relational and object-oriented databases [CS89, 
IK093, LL92, SKRT95, YLK94], as well as in data mining [AB97]. Compared to the inverted 
index, the signature file is more efficient in handling new insertions and queries on parts of 
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words. But the scheme introduces information loss. More specifically, its output usually in- 
volves a number of false drops, which may only be identified by means of a full text scanning 
on every text block short-listed in the output. Also, for each query processed, the entire signa- 
ture file needs to be searched [CF84, Fa85, Fa92]. Consequently, the signature file method in- 
volves high processing and I/O cost. This problem is mitigated by partitioning the signature 
file, as well as by exploiting parallel computer architecture [CZ96, Le95, SK86]. 

During the creation of a signature file, each word is processed separately by a hashing function. 
The scheme sets a constant number (m) of Is in the [1..F] range. The resulting binary pattern 
is called the word signature. Each text is seen to consists of fixed size logical blocks and each 
block involves a constant number (D) of non-common, distinct words. The D word signatures 
of a block are superimposed (bit OR-ed) to produce a single F-bit pattern, which is the block 
signature stored as an entry in the signature file. 

Fig. 1 depicts the signature generation and comparison process of a block containing three 
words (then D = 3), say "SGML", "database", and "information". Each signature is of length 
F= 12, in which m = 4 bits are set to 1. When a query arrives, the block signatures are scanned 
and many nonqualifying blocks are discarded. The rest are either checked (so that the "false 
drops" are discarded; see below) or they are returned to the user as they are. Concretely, a query 
specifying certain values to be searched for will be transformed into a query signature^ in the 
same way as for word signatures. The query signature is then compared to every block signa- 
ture in the signature file. Three possible outcomes of the comparison are exemplified in Fig. 1: 
(1) the block matches the query; that is, for every bit set in s qy the corresponding bit in the block 
signature s is also set (i.e., s a s q = s q ) and the block contains really the query word; (2) the 
block doesn't match the query (i.e., s a s q * s q ); and (3) the signature comparison indicates a 
match but the block in fact doesn't match the search criteria (false drop). In order to eliminate 
false drops, the block must be examined after the block signature signifies a successful match. 

block: ... SGML ... databases ... information ... 



word signature: 
SGML 
database 
information 



010 000 100 110 
100 010 010 100 
v 010 100 011 000 



queries: 

SGML 

XML 

informatik 



query signatures: 

010 000 100110 
011000 100 100 
110 100 100000 



matching results: 

match with OS 
no match with OS 
false drop 



object signature (OS) 



110 110 111 110 



Fig. 1. Signature generation and comparison 



In this paper, we propose a method to speed up the (sequential) signature file scanning by es- 
tablishing a tree structure, called signature tree for it just like a position tree for a text [AHU74], 
But by the construction of a position tree, a position identifier is a continuous piece of character 
sequence, while by the construction of a signature tree a signature identifier is not a continuous 
piece of bit string. 

A closely related work is the S-tree proposed in [De86]. It is in fact a B-tree built over a signa- 
ture file. Thus, it can be used to speed up the location of a signature in a signature file just like 
a B-tree for primary keys in a relational database. However, in the signature tree each path cor- 
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responds to a signature identifier which can be used to identify uniquely the corresponding sig- 
nature in a signature file. It helps to find the set of signatures matching a query signature 
quickly. 

Signature files can also be utilized as set access facility in OODBSs [IK093]. Especially, ac- 
cording to the analysis of [IK093], the bit-sliced signature file (BSSF) achieves a higher per- 
formance than the sequential signature file (SSF) by almost 50% (of time cost) in the best case. 
But the storage cost of BSSF doubles that of SSF and the update cost of BSSF triples that of 
SSF or more [IK093]. Later on, we'll see that the signature tree has a much better time com- 
plexity and less update cost than BSSF but with almost the same storage cost. 

2. Signature trees 

A first idea to improve the performance is to sort the signature file and then employ a binary 
searching. Unfortunately, this does not work due to the fact that a signature file is only an in- 
exact filter. The following example helps for illustration. 

Consider a sorted signature file containing only three signatures: 

010 000 100 110 
010 100 011 000 
100 010 010 100 

Assume that the query signature s q is equal to 000010010100. It matches 100 010 010 100. 
However, if we use a binary search, 100 010 010 100 can not be found. 

For this reason, we try another method and construct a signature tree to avoid scanning a sig- 
nature file completely. 

2.1 Definition of signature trees 

Consider a signature s t of length F. We denote it as s { = ^[1]^[2] ... s^F], where each s ( [f] e 
{0, 1} (j = 1, We also u&Gsfy\, ».J^)to denote a sequence ofpairs w.r.t.s,-: Q\ 9 j f [/i])(/2, 
SiUii) - (Jfo Si\Jh\)> w ^re 1 <j k <Fforke {\,...,h}. 

Definition 1 (signature identifier) Let S = s { .s 2 ... *s n denote a signature file. Consider s f (1 < i 
< n). If there exists a sequence: j h ...J h such that for any k * i (1 < k < n) we have s t {f\ 9 ...,j h ) 
* s k0\> -Jh)> tlien we sa y s ii/h —Jh) identifies the signature s t or say sfy\ 9 ...,/&) is an identifier 

Of Sj. 

Definition 2 (signature tree) A signature tree for a signature file S = s { .s 2 ... .s n , where s t ± Sj 
for i * j and |^| = F for k = 1 , n, is a binary tree T such that 

1. For each internal node of T, the left edge leaving it is always labeled with 0 and the right 
edge is always labeled with 1. 

2. Thas n leaves labeled 1, 2, n, used as pointers to n different positions of s {, s 2 ... and s n 
in S (signature file). For a leaf node u y p(u) represents the pointer to the corresponding sig- 
nature in S. 

3. Each internal node v is associated with a number, denoted sk(v) which is the bit offset of a 
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given bit position in the block signature pattern. That bit position will be checked when v 
is encountered. 



4. Let j\ , . J h be the numbers associated with the nodes on a path from the root to a leaf node 
labeled * (then, this leaf node is a pointer to the rth signature in S). Let p h p h be the se- 
quence of labels of edges on this path. Then, (j\,p\) ... Qh>Ph) ma ^es up a signature iden- 
tifier for s i9 sfy\ t ...J$. 

Example 1. In Fig. 2(b), we show a signature tree for the signature file shown in Fig. 2(a). In 
this signature tree, each edge is labeled with 0 or 1 and each leaf node is a pointer to a signature 
in the signature file. In addition, each internal node is associated with a positive integer (which 
is used to tell how many bits to skip when searching). Consider the path going through the 
nodes marked 1, 7 and 4. If this path is searched for locating some signature s, then three bits 
of s: s[l], s[7] and s[4] will have been checked at that moment. If s[4] = 1, the search will go 
to the right child of the node marked "4". This child node is marked with 5 and then the 5th bit 
of s: s[5] will be checked. 

See the path consisting of the dashed edges in Fig. 2(b), which corresponds to the identifier of 
s 6 : j 6 (1, 7, 4, 5) = (1, 0)(7, 1)(4, 1)(5, 1). Similarly, the identifier of s 3 is j 3 (1, 4) = (1, 1)(4, 1) 
(see the path consisting of the thick edges in Fig. 2(b)). 



*8- 



011001 000 101 
111011001 111 
111 101 010111 
011 001 101 111 
011 101 110 101 
011 111 110101 
011 001 111 111 
111011 111 111 



(a) 




Fig. 2. Signature tree 

In the next subsection, we discuss how to construct such a signature tree for a signature file in 
great detail. 

2.2 Construction of signature trees 

Below we give an algorithm to construct a signature tree for a signature file, which needs only 
0(N) time, where N represents the number of signatures in the signature file. 

At the very beginning, the tree contains an initial node: a node containing a pointer to the first 
signature. 

Then, we take the next signature to be inserted into the tree. Let s be the next signature we wish 
to enter. We traverse the tree from the root. Let v be the node encountered and assume that v is 
an internal node with sk(v) = /. Then, s[i] will be checked. If s[i\ = 0, we go left. Otherwise, we 
go right. If v is a leaf node, we compare s with the signature s 0 pointed by v. s can not be the 
same as v since in S there is no signature which is identical to anyone else. But several bits of 
s can be determined, which agree with sq. Assume that the first £bits of s agree with sq; but s 
differs from s 0 in the (A;+ l)th position, where s has the digit b and s 0 has 1 - b. We construct a 
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new node u with sk(u) = k + 1 and replace v with w. (Note that v will not be removed. By "re- 
place", we mean that the position of v in the tree is occupied by u. v will become one of w's 
children.) If b = 1 , we make v and the pointer to s be the left and right children of u, respectively. 
If b = 0, we make v and the pointer to s be respectively the right and left children of u. 

The following is the formal description of the algorithm. 

Algorithm sig-tree-generation[file) 
begin 

construct a root node r with sk(r) = 1 ; /* where r corresponds to the first signature s\ in the signature file*/ 
for j = 2 to n do 
call insert(.sy); 

end 

Procedure insert(s) 
begin 

stac/r <— roof; 

while stac* not empty do 

1 {v <- pop(s/acfc); 

2 if v is not a leaf then 

3 {/<-rf(v); 

4 if = 1 then {let a be the right child of v; pushfatacA:, a);} 

5 else {let a be the left child of v; push(stack, a);} 

6 } 

7 else (*v is a leaf*) 

8 {compare s with the signature pointed by p(v); 

9 assume that the first k bit of s agree with Sq; 

10 but s differs from sq in the (k + l)th position; 

11 H> <- v; replace v with a new node u with s£(w) = k + 1 ; 

12 if s[£ + 1 ] = 1 then make s and w be respectively the right and left children of u 

13 else make s and w be the right and left children of w, respectively;} 

14 } 
end 

In the procedure insert, stack is a stack structure used to control the tree traversal. 



Below we trace the above algorithm against the signature file shown in Fig. 2(a). 




Fig. 3. Sample trace of signature tree generation 
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In the following, we prove the correctness of the algorithm sig-tree-generation. To this end, it 
should be specified that each path from the root to a leaf node in a signature tree corresponds 
to a signature identifier. We have the following proposition. 

Proposition 1. Let Tbe a signature tree for a signature file S. Let P = v\ .e\ ... v g . { ,e g .\ .v g be a 
path in T from the root to a leaf node for some signature s in S y Le, p(v g ) = s. Denote j { = sk(vj) 
(/= 1, l^Then,^,;^ • Jg-\) = {ihK e \)) -{jg-hbie g -\)) constitutes an identifier for s. 

Proof, Let S = s\.S2 ... .s n be a signature file and Ta signature tree for it. LetP = v\e\ ... v gm \e g . 
iv g be a path from the root to a leaf node for j f in T. Assume that there exists another signature 
s t such that sfy\J 29 ~,jg.\) = sfiiJit - Jg-\\ where y,- = sHy t ) (/ = 1, g - 1). Without loss 
of generality, assume that t > i. Then, at the moment when s t is inserted into T, two new nodes 
v and v' will be inserted as shown in Fig. 4(a) or (b). (see hnes 10 -15 of the procedure insert) 
Here, v ' is a pointer to s t and v is associated with a number indicating the position where p(v t ) 
and p(v *) differs. 




Fig. 4. Inserting a node v into T 

It shows that the path fors z should be v x ,e x ... v g .\.e.ve\v g or v { .e l ... v g _\.e.ve 9 \v g9 which con- 
tradicts the assumption. Therefore, there is not any other signatures, with s t (j\J 29 ... J nm \) = (f\, 
b(e\)) -O'n-i. So jX/u72» -Jn-i) is an identifier of s t . □ 

The analysis of the time complexity of the algorithm is relatively simple. From the procedure 
insert, we see that there is only one loop to insert all signatures of a signature file into a tree. 
At each step within the loop, only one path is searched, which needs at most 0(F) time. Thus, 
we have the following proposition. 

Proposition 2. The time complexity of the algorithm sig-tree-generation is bounded by 0(7V), 
where N represents the number of signatures in a signature file. 

Proof. See the above analysis. □ 

3. Searching and maintenance of signature trees 

In this section, we discuss the searching and maintenance of signature trees. 
3.1 Searching a signature tree 

Now we discuss how to search a signature tree to model the behavior of a signature file as a 
filter. Let s be a query signature. The Mh position of s q is denoted as ^(/). During the traversal 
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of a signature tree, the inexact matching is defined as follows: 

(i) Let v be the node encountered and s q (/) be the position to be checked. 

(ii) If Sq (i) = 1 , we move to the right child of v. 

(iii) If s q (i) = 0, both the right and left child of v will be visited. 

In fact, this definition just corresponds to the signature matching criterion. 

To implement this inexact matching strategy, we search the signature tree in a depth-first man- 
ner and maintain a stack structure stackp to control the tree traversal. 

Algorithm signature-tree-search 
input: a query signature s q \ 

output: set of signatures which survive the checking; 
l.S<-0. 

2. Push the root of the signature tree into stackp. 

3. If stackp is not empty, v <— pop(stack p ); else return(5). 

4. If v is not a leaf node, / <— sk(v); 

If s q (0 = 0, push c r and C/ into stack p ; (where c r and C/ are v's right and left child, respec- 
tively.) otherwise, push only c r into stack p . 

5. Compare s q with the signature pointed by p(v). /*/?( v ) - pointer to the block signature*/ 
If s q matches, S <- S u {p(v)}. 

6. Go to (3). 

The following example helps for illustrating the main idea of the algorithm. 

Example 2 Consider the signature file and the signature tree shown in Fig. 2 once again. 




Fig. 5. Signature tree search 

Assume s q = 000 100 100 000. Then, only part of the signature tree (marked with thick edges 
in Fig. 5) will be searched. On reaching a leaf node, the signature pointed by the leaf node will 
be checked against s q . Obviously, this process is much more efficient than a sequential search- 
ing. For this example, only 42 bits are checked (6 bits during the tree search and 36 bits during 
the signature checking). But by the scanning of the signature file, 96 bits will be checked. In 
general, if a signature file contains N signatures, the method discussed above requires only 
0(N/2 l ) comparisons in the worst case, where / represents the number of bits set in s q> since 
each bit set in s q will prohibit half of a subtree from being visited. Compared to the time com- 
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plexity of the signature file scanning 0(N) y it is a major benefit. We will discuss this issue in 
the next section in more detail. 

3.2 Maintenance of a signature tree 

When a signature s is added to a signature file, the corresponding signature tree can be changed 
easily by running the algorithm insertQ once with s as the input (see 2.2). 

When a signature is removed from the signature file, we need to reconstruct the corresponding 
signature tree as follows: 

(i) Let z, w, v, and w are the nodes as shown in Fig. 6(a) and assume that the v is a pointer to 
the signature to be removed. 





(a) (b) 

Fig. 6. Illustration for deleting a signature 

(ii) Remove u and v. Set the left pointer of z to w. (If u is the right child of z, set the right 
pointer of z to w.) 

The resulting signature tree is as shown in Fig. 6(b). 

From the above analysis, we see that the maintenance of a signature is an easy task. 
4. Computational Complexity 
4.1 Time complexity 

To analyze the performance of the signature tree, we consider four parameters: N- the number 
of signatures in a signature file, F - the signature length, m - the number of bits set to 1 in a 
signature, and D - the size of a block. When the average signature is half-populated with Is, the 
false drop probability and storage overhead trade-off combination is optimized [CF84]. In such 
a setting, the two parameters N and F satisfy the following inequality. 

N< -U- F <•> 

We have the above inequality based on a simple observation that i fN > ^ j there must exist 
two signatures having the same binary strings. In this case, one of them wil?be removed from 
the signature file. 

In terms of Stirling formula: F\ ~ JlnF^ F , we have 
Then, we have 
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VJtF 



(3) 



From this, we have log^ < I - I log 2 * - I log 2 F + F. 
Thus, F satisfies the following inequahty: 

log2 " " \ + \ log2% + \ lo&F - F < 4) 
According to [CF84, DML98], in the case that the average block signature involves an equal 
number of Is and Os, the three design parameters m y F, and D satisfy the relationship below: 

Fxln2 = wxD (5) 

In addition, averagely / (the number of bits set to 1 in a query signature) is equal to m. 

From the above, we derive the time complexity of the signature tree searching as follows: 



M2'~M2 m = M2 D 



(6) 



In terms of (2) and (6), we have 

Fln2 „ 1^1 

— — - (logjV- - + - 

Nil <N/2 
Finally, we have the inequahty 



ln2 



N/2 D <N/2 =N/{N ^ ^ jFy (7) 



Fln2 



ln2 



N/2 D <N/^N ± Jn jFy< 



N/ 



Af.X. Jn ^logN-^ + log Jtz + log jFj 



, ln2 
ft* 



Fig. 7 shows the calculation relating to the above formula. 



]n2 

7t, 



(8) 

ln2 

Nl (NlogN) D 




signature file scanning: — 
signature tree searching: 



number of bit 
comparisons 

12000 

10000 

8000 
6000 
4000 
2000 



2000 4000 6000 8000 10000 12000 
Fig. 7. Time complexities of signature file scanning and signature tree searching 

In Fig. 7, the signatures checked is computed in terms of N- the size of a signature file. From 
this, we can see that the performance of the signature tree searching degrades as the size of a 
block increases. It is because given a fixed signature length a larger block requires that fewer 
bits in a term signature are set to 1 , which weakens the filtering power of signature trees when 
it comes to single term query processing. 

4.2 Extra space overhead of a signature tree 

Note that the signature tree is a binary tree. Thus, a signature tree can be stored as a set of triples 
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of the form: <v, lp, rp> 9 where v represents the number associated with a node, lp represents the 
pointer to the left subtree and rp represents the pointer to the right subtree. 

Assume that the length of a signature is F and the number of signatures in a file is N. (The size 
of the signature file is therefore N x F bits.) Then, for each v we need log 2 F bits and for each lp 
{rp) we need log^N bits. Accordingly, for all the internal nodes of a signature tree, we need N 
x log 2 F + 2Nxlog 2 Nbits space. To mitigate this problem to some extent, we use the following 
relative address encoding: 

(1) The triples for a signature tree are stored in the breadth- first order. 

(2) lp and rp are relative addresses, i.e., the absolute address of node v' (denoted add(v y )) 
pointed by lp (or rp) is equal to add(v 9 ) = add(y) + lp (or add(v) + rp). 

In this way, we need only 2 bits for the addresses of the nodes at the first level, 2 2 bits for the 
second level, 2 3 bits for the third level, and so on. The space overhead can then be reduced to 

k 

Nxlog 2 F + 2 JV-O + I) , 
where 2* = N. It is almost half of the size of the corresponding signature file. 
5. Conclusion 

In this paper, a new concept of signature identifiers has been introduced, which can be used dif- 
ferentiate signatures in a signature file from each other. Based on this concept, a tree structure, 
called a signature tree is proposed, in which each path from the root to a leaf node corresponds 
to a signature identifier. Then, the scanning of a signature file can be replaced by the traversal 
of a signature tree, which improves the query processing efficiency significantly. 
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